Project Overview

In a 10-year dataset from peer-to-peer lender Prosper Loans, loans that failed within the first 10 months generated APYs ranging from -100% (total loss) to more than +75%. Why would both the best and worst performing loans be associated with the earliest closing dates? This analysis investigates the drivers of annualized percentage yield (APY) and uses real-world data to illustrate the basic intuition behind the results of early loan termination. Predicted yields based on financial formulas are compared with the data through exploratory visualizations, regression analysis and questions-and-answers.

This analysis was the final project for Udacity’s Data Analysis With R course, completed in November 2015.

Data Prep

Load libraries

Load data

Subsetting the data for analysis

The original dataset was very large with 81 variables and 113,937 loan observations. I selected a subset of variables to consider in line with the assignment instructions. During the process of exploring the data, I realized there was a need to exclude some of the original observations in order to calculate a valid estimate of the base rate for lost loans. The refined dataset then became the subject of my exploration and analysis.

In order to have consistent data through all of the plots I am running the data adjustments at the beginning of this paper rather than implementing them in successive stages as they emerged in linear time. In the following sections I will explain the data cleaning decisions according to the original flow of logic in order to document the exploratory process.

Univariate Plots

Dataset Overview

Let’s see the date distribution of loans

Why is there a gap in loan originations around 2009?
- Press reports indicate that the company had to shut down for 9 months in late 2008 due to SEC intervention

Plots of Borrower Features

Creating LoanStatus.lost variable where:
* “Lost” = Defaulted and Chargedoff loans
* “Not Lost” = all other loan statuses

  • little variation in reported employment status. 70% report “employed” or “full-time”.
    ***

  • Most common income range is $25-50k
  • Even at the highest income levels there are still a fair number of lost loans

Borrower delinquencies

Plotting delinquency variables including zero values on left, excluding zero values on right.

Delinquency measures have extreme outliers. The mean amount delinquent is 968 while the max is 444745.

Borrower income and credit

  • Monthly income, debt to income ratio, and current credit lines all have long right-hand tails.
  • FirstRecordedCreditLine could be a proxy for age. Many borrowers had their first credit lines in the 1990s so they are probably in their mid-30s to mid-40s.

  • There are a large number of occupations with the most common being “other” so it may be difficult to draw firm conclusions on the traits of specific ones.

Plots of Loan Features

  • Vast majority of loans have 36-month term
  • BorrowerRate and LenderYield have similar distribution (one is derived from the other)
  • Investment from friends is negligible with mean = 45.2 and third quartile = 0.

Most loans are $5000 or less.

Loan performance

  • 50% of defaulting loans do so within the first 16 months
  • Not much principal is recovered by collections. Median Net Principal Loss is $2733 compared to $2779 Gross Principal Loss.
  • LP_ServiceFees and LP_CollectionFees are given as negative numbers

  • Not many cancelled loans, removing them from dataset

Loan loss rate

Loan status plot shows a fair number of defaulted or charged off loans. How to calculate the base rate for lost loans?
- Including open loans could deflate the true loss rate because open loans may yet default.
- Including all closed loans could inflate the loss rate by including loans that failed before the end of their term while excluding non-failed loans from the same cohort that are still open.
- Approach chosen: calculate loss rate using completed loan cohorts based on expected completion date (LoanOriginationDate + Term) less than max data date (2014-03-10)

Eliminate open loans

Calculate expected term end date

Subset data on completed loan cohorts

Loan loss rate

Results:
- Surprisingly high 31% of loans from completed cohorts are charged off or in default.
Curious: What proportion of loans closed before their expected end of term date? 75%

I wonder if early repayment has any effect on investor return.
- creating AgeAtClosingDays and AgeAtClosingMos variables to show time to loan completion in days and months, respectively

## Source: local data frame [4 x 5]
## Groups: Term [?]
## 
##    Term LoanStatus.lost MeanAgeAtClose MedianAgeAtClose     n
##   (int)          (fctr)          (dbl)            (dbl) (int)
## 1    12        Not Lost            8.8               11  1420
## 2    12            Lost            9.9               10    80
## 3    36        Not Lost           25.0               30 25531
## 4    36            Lost           18.2               16 12119

For 36-month loans:
- Lost loans fail early: median age at closing is only 16 months
- Most completed loans close early too: median age at closing is 30 months

Observations:
- The most common repayment month for completed loans is at the full term with the rest distributed fairly evenly across the earlier months.
- Both completed and lost loans have a number of loans that continue beyond the end of the official loan term. These loans are probably marginal performers, with some managing to pay off the loan in the extra time and others failing.

Univariate Analysis

What is the structure of your dataset?

  • Original dataset comprises 81 variables and 113,937 loan observations
  • Refined dataset comprises 34 variables and 39,155 observations
  • Seven variables are factors and two are dates
  • Loan term can be 12 or 36 months
  • Loan amounts range from $1000 - $25,000
  • 75% of loans are $7,500 or less
  • 31% of loans are charged off or in default
  • Loan closing dates range from 2005-11-25 to 2014-03-10
  • Mean borrower rate = 18.8%

What is/are the main feature(s) of interest in your dataset?

I am intrigued by the high proportion of loans that close early and am focusing on the impact of early loan closing on investor return.

Features involved in investor return: + LoanOriginalAmount + LoanStatus + LP_ServiceFees + LP_CollectionFees + LP_CustomerPayments + LP_CustomerPrincipalPayments + LP_GrossPrincipalLoss + LP_InterestandFees + LP_NetPrincipalLoss + LP_NonPrincipalRecoverypayments

Features involved in early repayment: + LoanOriginationDate + Term + ClosedDate

Features involved in expected yield: + LenderYield + BorrowerRate

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It would be interesting to look for borrower characteristics that predict early repayment. The features I will examine are: Occupation, EmploymentStatus, CreditScoreRangeUpper, FirstRecordedCreditLine, CurrentCreditLines, CurrentDelinquencies, IncomeRange.

Did you create any new variables from existing variables in the dataset?

New Variable Type Definition
LoanStatus.lost Factor “Lost” = defaulted or charged-off loan; “Not Lost” = all other loan statuses
Term.EndDate Date Expected end date of loan based on origination date and term
AgeAtClosingDays num Age of loan at closing date (days)
AgeAtClosingMos num Age of loan at closing date (months)

After some data cleaning in the next section these additional variables will be derived:

New Variable Type Definition
PrincipalGap num Difference between original loan amount and the sum of customer principal payments + gross principal loss
RecoveredPrincipal num Difference between gross principal loss and net principal loss
InvestorGain num Difference between investor’s income and expenses from loan. May be positive or negative.
InvestorReturn num InvestorGain as a percentage of average daily principal balance
InvestorAPY num Investor annual percentage yield (APY)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Many features were skewed right because they represented real quantities with a large number of zero values. For example, most people had no delinquencies in the past 7 years, but a few had more than 10. I used log or sqrt transformations on these variables to condense and normalize the distributions for plotting. Zero values become ‘NaN’ in R when transformed by logarithm, so these values do not appear in the resulting plots. If the number of zero values in the data was small, I used a sqrt transformation so as to keep them in the plot. If the data was dominated by zero values, I used a log transformation.

Bivariate Plots

In order to examine the effect of early repayment on investor return, I first need to derive return-related metrics from the loan data.

Calculating Investor Return

The main benchmark of loan performance is annual percentage yield (APY). APY is used because it allows comparisons of different investments by standardizing the return with respect to time. In calculating APY I referred to the following definitions:
Gain from Investopedia
Return according to Prosper Loan definition
Annual Percentage Yield (APY) from Investopedia

The dataset’s loan performance (LP) variables provide sufficient detail to estimate gain, return, and APY for each loan observation.

Before I calculate the new variables, I will examine bivariate plots of LP variables to make sure that the data tally up properly.

Customer payments before charge-off

Total customer payment equals the sum of customer principal payments plus interest and fees, as expected.

Principal payments and losses

The sum of pre-chargeoff principal payments and gross principal loss should equal 100% of the original loan amount.

  • This plot shows that some loans don’t tally properly: there is a difference between the original loan principal and the principal flows recorded during the payback period.
  • Most totals are within +/-10% of the original loan amount but some tens of observations fall outside that range.

Let’s look at the principal gap separated out by loan status:

The data diverges for both types of loans. Let’s drill down into the LP_GrossPrincipalLoss variable to look for clues.

Expected patterns:
* fully repaid loans (Not Lost) ==> gross principal loss = zero
* charged off loans (Lost) ==> gross principal loss > 0

For fully repaid loans, GrossPrincipalLoss is zero as it should be. For lost loans, there’s no reference point against which to measure the accuracy of GrossPrincipalLoss.

Let’s look at CustomerPrincipalPayments next.

For fully repaid loans, customer principal payments should equal the original loan amount. For lost loans, customer principal should be less than the original loan amount.

  • In this case the data for lost loans looks pretty good: customer principal payments are less than or equal to 102% of loans for all observations
  • The Not Lost plot shows many data points where customer principal payments are underreported and a few where they are overreported. Because principal payments are submitted over many months, there are many opportunities for errors to get into the payments data.

Let’s look at the Principal Gap before making any decisions about how to clean the data.

Distribution of Principal Gap Errors

First I will create a new variable called PrincipalGap as the difference between the original loan amount and the sum of customer principal payments + gross principal loss. When the original loan amount exceeds the payment total, the gap will be positive. When the original loan amount is less than the payment total, the gap is negative.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5000       0       0       9       0   12500

The principal gap takes both positive and negative values. Let’s plot the gaps.

There are 14691 loans with a Principal Gap but only 1130 where the gap is greater than 1% of the original loan amount.

Are there any visible correlations between principal gap and early repayment?

No pattern that I can see. Dropping observations with large principal gaps should not affect my analysis of the effects of early repayment.

Cleaning the payments data

In order to improve the validity of the data without losing too many data points, I decided to take two cleaning steps:
1. Drop all data points where Principal Gap > 1% of the original loan amount
2. Adjust customer principal payments to eliminate the gap for the remaining observations

Check for remaining principal gap in cleaned data.

Elimination of principal gap confirmed.

Now let’s check gross and net principal loss. NetPrincipalLoss should always be greater than or equal to zero, and can’t exceed 100% of GrossPrincipalLoss.

The “Not Lost” plot is blank because there are no principal losses for fully repaid loans. The “Lost” plot shows that net principal loss is generally between 0% and 100% of gross principal loss, as expected. But, there are still a number of observations outside of these bounds:
- 1721 observations where net loss is greater than gross loss
- 56 observations where net loss is given as a negative number
- 1 observations where gross loss is given as a negative number

The vast majority of the differences are less than $1 - rounding errors. I will drop points with differences more than $1 and adjust the remainder to eliminate negative differences.

Now let’s calculate RecoveredPrincipal as the difference between net and gross principal loss.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0      40       0   17800

Final check for well-behaved payments and loss data.

Now that the principal loss data has been cleaned, investor return can be calculated.

Prosper’s methodology for calculating return states: To calculate the Return, all payments received on borrower loans, net of principal repayment, credit losses, and servicing costs for such loans, are aggregated and then divided by the average daily amount of aggregate outstanding principal.

To use Prosper’s definition of return I need to estimate the average daily principal balance for each loan. This won’t be 100% accurate for loans with late payments, but it’s a reasonable approximation based on the available data. I’ll discuss the drawbacks of using this definition in the question section.

Create AvgPrincipal and InvestorReturn variables

Calculate InvestorAPY

Plots of Investor Return

Because there are relatively few 12-month loans in the data, I am going to remove them to simplify the analysis.

Create BorrowerRate buckets

While most lost loans result in losses to the investor they show a range of outcomes from more than $20,000 lost to gains of more than $5000. Both plots show a steep exponentially declining curve for positive gains. Negative gains (losses) also look to be declining exponentially, but not as steeply as positive gains.

Investor APY vs Loan Age at Closing

Observations:
* InvestorAPY is highly stratified by BorrowerRate within completed loans
* APY range narrows as loan age increases

  • In a somewhat counterintuitive finding, some of the highest APYs for both completed and charged-off loans are achieved by loans that close in the first 10-20 months.
  • Another striking thing about this plot is how the range of returns for lost loans varies from total loss (-100%) to nearly 100%. How can returns be so high for loans where the principal is not completely repaid? I suspect that will be related to the interaction of multiple factors, so I will look into that question further in the multivariable section.

Observations:
* By definition, LenderYield is equal to the interest rate on the loan less the servicing fee. So this plot shows the relationship between borrower interest rate and investor return, with an offset based on the size of the loan servicing fee.
* The relationship between LenderYield and InvestorReturn is not very straightforward. The scatterplots have a roughly triangular shape bounded on top by an upward-sloping line. This shows that for the same initial LenderYield, investor return can take on any of many values up to a certain maximum. Could this be related to the returns from different loan amounts associated with a given borrower interest rate?

Observations:
* Completed and lost loans seem to reflect opposite effects of early repayment. Among completed loans, early closing loans have the highest APY whereas among lost loans they have the lowest.

Plots Against Borrower Characteristics

Completed loans

Observations:
* InvestorAPY has very low correlation with AgeAtClosingMos (-0.037).
* FirstRecordedCreditLine needs to be converted from factor to numeric before running ggpairs. Otherwise with >11,000 factor levels it will take a long time!
* CreditScoreRangeUpper has a high negative correlation with APY of -0.514.

Observations:
* Not seeing a lot of variation in the boxplot of APY by Income bracket, at least at this scale.
* Debt-to-Income ratio is not highly correlated with either Loan Age or APY.

Observations:
* Both current delinquencies and delinquencies over last 7 years have a positive correlation with APY of about 0.2. I suspect that is driven by the higher interest rates given to borrowers with poorer credit histories. Since the dataset for these plots excludes defaulted loans, the true correlation of delinquencies with loan performance is distorted.

The Occupation factor has too many levels to create a readable plot in ggpairs, so I am examining it in a separate boxplot.

  • Occupations most associated with early repayment are Investor and Homemaker (by median age of loan repayment)
  • Sample size is very small in both cases: only 21 unique Investors and 31 unique Homemakers
  • Military personnel (enlisted and officers) also have a tendency to repay early
  • Occupations least associated with early repayment are Judge, Pharmacist, Psychologist, Student - Community College and Tradesman - Carpenter

Charged off and defaulted loans

  • InvestorAPY has a 0.834 correlation with Age at Closing for lost loans.

  • There are many outliers in the CurrentDelinquecies by Income Range boxplot.

Bivariate Questions

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Early loan closing has a different relationship with Investor APY based on whether the loan completed or defaulted. For defaulted loans, APY generally increases with age at loan closing. For completed loans, the opposite is true.

Among borrower characteristics, certain occupations such as homemaker and military were more associated with early repayment. Generally the correlations between the borrower characteristics examined and loan age at repayment were not very strong.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was surprised to find some relationships between occupation and early repayment. Interestingly, borrowers who categorized themselves as “investors” tended to repay loans early. Pharmacists and scientists had the highest median percentage of principal repaid among lost loans. It could be that these relationships exist because certain occupations are correlated with higher income levels, and that is driving the repayment trends.

What was the strongest relationship you found?

Other than correlations driven by the financial algebra (where one variable is an input into another, such as percent of principal repaid and investor return), the strongest relationship I found was a -0.514 correlation between Investor APY and borrower’s credit score (upper range). Why higher credit scores should be associated with lower APY is a bit of a head-scratcher. I think the relationship may be mediated by borrower interest rates. Higher credit scores earn lower interest rates, which generate lower yields.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were clear problems with some of the loan repayment data, where total principal payments and losses failed to equal the original principal. I did a signficant amount of cleaning in the form of dropping data that was significantly out of the correct range and adjusting data that had smaller deviations along the lines of rounding errors. These steps were necessary to reduce distortion in the performance metrics that I derived from the data.

The financial performance metrics that I derived included investor gain, return, and APY. These variables were necessary in order to explore loan performance. Because Prosper Loans calculates interest payments and loan servicing fees on a compounded (i.e. periodic) basis, getting an exact measure of investor return using their definition is actually a bit complicated. Prosper called for return to be calculated based on the average daily principal balance. This is actually a good approach because it takes into account the fact that the investor gets some of the principal returned with each payment. However, the dataset doesn’t contain all the information needed to determine the average daily principal balance accurately. I made an estimate for each loan based on the assumption that the principal was repaid according to the standard payment schedule. This is clearly not the case for many defaulted loans. The errors that this methodology caused were particularly problematic when the total investor loss turned out to be more than the estimated average daily principal balance (a situation that basically blew up my formulas). This removed a small percentage of my data with ‘NA’ values and caused other estimates to be somewhat inaccurate.

As a result of these issues, my estimates of investor return from the loan data are not completely accurate. More error is expected for lost loans than for fully repaid loans because of the greater likelihood of variation in principal repayment among the former. If this exercise were part of a real business project I would spend more time developing an approach to address these issues, but for the purpose of this assignment I am simply noting the problems and the limitations of my approach here.

Multivariate Plots

Multivariable Interaction Plots

Previous plots showed that some lost loans can generate high investor APY of 50-100%. My guess is that high interest rates and penalty fees are responsible for these positive returns, so have created plots to look into that.

The next plot shows actual interest and fees paid by borrowers as a percentage of the original loan amount. Each tile shows the distribution for a range of borrower interest rates.

The lengths of the right-hand tails, representing loans where the interest and fees paid were equivalent to increasingly high percentages of the original loan amount, appear to increase as the interest rate buckets get higher. This could definitely contribute to positive investor returns even when the principal isn’t fully repaid.

This plot shows a number of trends that characterize lost loans:
- Investor return generally increases the longer loan payments continue - Within each loan age subgroup, higher investor return corresponds to higher interest and fee payments - The defaulted loans seen in previous plots with high returns of more than 100% are outliers and represent a very small proportion of total loans

On the other hand the driver for positive return among defaulted loans may be the percentage of principal repaid - let’s check.

Median percent of principal repaid for lost loans with positive return is 67% compared to 14% for loans with negative return. That must be what is driving the wide spread in returns for defaulted loans repaid early.

  • This plot confirms that higher returns on defaulted loans are clearly associated with the percent of principal repaid.
  • It suggests that when the percent of principal repaid is low and investor return is still positive, the borrower interest rate is high.
  • There are some data points where the percent of principal paid is more than 100%, so there is some noise in the data.

Linear Regression Models

Now that I’ve identified early repayment, percent of principal repaid and borrower rate as important variables in determining APY on the basis of data visualizations, I will run regression models on InvestorAPY to confirm and quantify these relationships. There will be models for three sets of data: the first model includes data from all loans, while the second and third utilize data from completed and lost loans, respectively.

First model independent variables are:
- Percent of principal repaid
- Borrower interest rate
- Interest and fees as a percentage of loan amount
- Loan age at closing

## 
## Calls:
## m1: lm(formula = InvestorAPY ~ pct.prin.repaid, data = pldCohort)
## m2: lm(formula = InvestorAPY ~ pct.prin.repaid + BorrowerRate, data = pldCohort)
## m3: lm(formula = InvestorAPY ~ pct.prin.repaid + BorrowerRate + IntFees.pct, 
##     data = pldCohort)
## m4: lm(formula = InvestorAPY ~ pct.prin.repaid + BorrowerRate + IntFees.pct + 
##     AgeAtClosingMos, data = pldCohort)
## 
## ============================================================
##                      m1         m2         m3         m4    
## ------------------------------------------------------------
## (Intercept)      -0.860***  -1.077***  -1.010***  -0.931*** 
##                  (0.002)    (0.002)    (0.002)    (0.002)   
## pct.prin.repaid   1.020***   1.080***   1.015***   1.040*** 
##                  (0.002)    (0.002)    (0.002)    (0.001)   
## BorrowerRate                 0.901***   0.251***  -0.060*** 
##                             (0.007)    (0.008)    (0.009)   
## IntFees.pct                             0.467***   0.715*** 
##                                        (0.004)    (0.005)   
## AgeAtClosingMos                                   -0.004*** 
##                                                   (0.000)   
## ------------------------------------------------------------
## R-squared             0.880      0.917      0.941      0.949
## adj. R-squared        0.880      0.917      0.941      0.949
## sigma                 0.126      0.105      0.088      0.082
## F                261691.138 196976.573 187817.266 166529.725
## p                     0.000      0.000      0.000      0.000
## Log-likelihood    23309.073  29853.444  35767.629  38580.597
## Deviance            561.119    388.319    278.430    237.685
## AIC              -46612.145 -59698.889 -71525.258 -77149.193
## BIC              -46586.708 -59664.973 -71482.863 -77098.320
## N                 35557      35557      35557      35557    
## ============================================================

Observations:
- The coefficients of all explanatory variables and the intercept are different from zero at a significance level of p < .01
- Repaid principal accounts for most of the variation in Investor APY with an R-squared of 0.88.
- Adding the borrower interest rate increases R-squared to 0.916
- Adding interest and fees increases R-squared to 0.94
- Adding loan age at closing increases R-squared to 0.949

Now let’s look at the determinants of APY for successfully completed loans only. I will drop percent of principal repaid since it is 100% for all loans in this category. Because the coefficient on AgeAtClosingMos is smaller than the number of default decimal places (<0.000) I have multiplied APY by 100 in this model.

Second model independent variables are:
* Borrower interest rate
* Interest and fees as a percentage of loan amount
* Loan age at closing

## 
## Calls:
## n1: lm(formula = InvestorAPY * 100 ~ BorrowerRate, data = subset(pldCohort, 
##     LoanStatus.lost == "Not Lost"))
## n2: lm(formula = InvestorAPY * 100 ~ BorrowerRate + IntFees.pct, 
##     data = subset(pldCohort, LoanStatus.lost == "Not Lost"))
## n3: lm(formula = InvestorAPY * 100 ~ BorrowerRate + IntFees.pct + 
##     AgeAtClosingMos, data = subset(pldCohort, LoanStatus.lost == 
##     "Not Lost"))
## 
## =================================================
##                      n1         n2         n3    
## -------------------------------------------------
## (Intercept)       0.020     -0.121***   0.776*** 
##                  (0.035)    (0.033)    (0.065)   
## BorrowerRate     83.054***  92.077***  87.353*** 
##                  (0.180)    (0.263)    (0.392)   
## IntFees.pct                 -6.321***  -2.795*** 
##                             (0.138)    (0.258)   
## AgeAtClosingMos                        -0.035*** 
##                                        (0.002)   
## -------------------------------------------------
## R-squared             0.895      0.903      0.904
## adj. R-squared        0.895      0.903      0.904
## sigma                 2.276      2.186      2.175
## F                212044.122 115956.396  78197.557
## p                     0.000      0.000      0.000
## Log-likelihood   -55819.832 -54816.643 -54686.917
## Deviance         128981.911 118998.890 117765.683
## AIC              111645.663 109641.285 109383.833
## BIC              111670.032 109673.777 109424.448
## N                 24906      24906      24906    
## =================================================

This model has an R-squared of 0.904 and all coefficients are significantly different from zero at p < 0.01.

The third model is run on charged off and defaulted loans only. The independent variables are:
- Percent of principal repaid
- Borrower interest rate
- Interest and fees as a percentage of loan amount
- Loan age at closing

## 
## Calls:
## o1: lm(formula = InvestorAPY ~ pct.prin.repaid, data = subset(pldCohort, 
##     LoanStatus.lost == "Lost"))
## o2: lm(formula = InvestorAPY ~ pct.prin.repaid + BorrowerRate, data = subset(pldCohort, 
##     LoanStatus.lost == "Lost"))
## o3: lm(formula = InvestorAPY ~ pct.prin.repaid + BorrowerRate + IntFees.pct, 
##     data = subset(pldCohort, LoanStatus.lost == "Lost"))
## o4: lm(formula = InvestorAPY ~ pct.prin.repaid + BorrowerRate + IntFees.pct + 
##     AgeAtClosingMos, data = subset(pldCohort, LoanStatus.lost == 
##     "Lost"))
## 
## ============================================================
##                      o1         o2         o3         o4    
## ------------------------------------------------------------
## (Intercept)      -0.941***  -1.198***  -1.034***  -1.084*** 
##                  (0.003)    (0.006)    (0.004)    (0.004)   
## pct.prin.repaid   1.359***   1.424***   1.047***   0.837*** 
##                  (0.007)    (0.006)    (0.005)    (0.009)   
## BorrowerRate                 1.074***  -0.089***  -0.119*** 
##                             (0.022)    (0.017)    (0.017)   
## IntFees.pct                             0.922***   0.919*** 
##                                        (0.008)    (0.007)   
## AgeAtClosingMos                                    0.007*** 
##                                                   (0.000)   
## ------------------------------------------------------------
## R-squared            0.799      0.836       0.930      0.935
## adj. R-squared       0.799      0.836       0.930      0.935
## sigma                0.178      0.161       0.105      0.101
## F                42428.285  27197.604   47329.170  38224.242
## p                    0.000      0.000       0.000      0.000
## Log-likelihood    3286.844   4370.033    8913.155   9281.272
## Deviance           336.414    274.498     116.963    109.151
## AIC              -6567.688  -8732.066  -17816.310 -18550.543
## BIC              -6545.868  -8702.972  -17779.943 -18506.903
## N                10651      10651       10651      10651    
## ============================================================

This model has an R-squared of 0.934. The coefficient on borrower interest rate is negative, indicating higher rates are associated with lower APY. This could be due to the higher risk of default associated with borrowers that are given higher rates.

Financial Formulas

Since the formulas governing InvestorAPY are known from Prosper Loan materials and general financial definitions, I can determine what the effects of early repayment should be from a theoretical perspective and use this information to help interpret the loan data.

Below is a function to approximate cumulative interest payments, investor service fees and average daily principal balance given inputs for principal, interest rate, loan term, and final repayment date.

Assumptions include:
- Interest and investor service fees accrue monthly (Prosper loans accrue interest and fees daily, so my model is simplifying that aspect)
- Investor service fee of 1% of outstanding principal balance each payment period

Simulated loan data for two scenarios:
- 36-mo, $5000 loan at 20% interest rate
- 36-mo, $10000 loan at 13.5% interest rate

Predicted vs Actual Relationships

This plot illustrates the basic intuition behind the advantage of early repayment to the investor. By the midpoint of the loan term at 18 months, more than half of the total interest has been paid but less than half of the principal has been recovered. If the borrower repays the loan early, the investor can reinvest all of the principal in another loan and thereby increase APY. Since APY declines continuously with time, the model suggests that the earlier the repayment, the better for the investor.

There are a few loans at 36 months that have higher return than average and lower service fees, but there are still low-fee loans sprinkled among those with returns below the predicted curve. Part of this could be due to the fact that I included loans with interest rates between 13% and 14% in order to increase the number of data points, whereas the predictive model is based on exactly 13.5%.

  • There are a number of data points that fall well below the predicted APY curve. Not sure why that is. Doesn’t appear to be connected to the service fees, as they are similar for points on and below the curve.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Borrower interest rate and percent of principal repaid both were major contributors to investor APY and reinforced each other’s effects in the case of lost loans. Since percent of principal repaid increases with the number of payments made over time, this variable probably accounts for some of the correlation that I observed in the initial scatterplots between loan age at closing and investor APY.

Were there any interesting or surprising interactions between features?

Not really. Since the variables of interest were driven by predictable algebraic formulas the relationships were pretty much expected once I decomposed the data properly.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I create two types of model: OLS linear regression models and a top-down model based on the financial formulas governing loan payments. Given my focus on the effects of early repayment on investor APY, the financial formulas provide the better model because they allow for very specific and accurate predictions of APY based on interest rate, loan term, loan amount and time to repayment.

What the models I developed don’t address is the risk element of the loan process. Now that I understand the mechanics of the loan process from a structural point of view, it would make sense to explore the data for insights on how borrower characteristics relate to the risk of default. For this question regression models would be a better choice than financial models since the relationships are not known in advance.

Final Plots and Summary

Plot One

Description One

At first glance, these plots of annual percentage yield (APY) don’t seem to make sense: the earlier the loans close, the wider the range of APY. Why would both the best and worst performing loans be associated with the earliest closing dates? Looking at successfully completed loans on the left, the upper bound of the data seems to slope downward as the months increase. On the right, loans that failed within the first 10 months have APYs ranging from -100% (total loss) to more than +75%. This plot sparked my investigation of the effect of early repayment on loan performance in the Prosper data.

Plot Two

Note: Plot depicts model results for $5000, 36-month loans @20% interest

Description Two

This plot shows the drivers of APY for fully repaid loans and illustrates the basic intuition behind the benefits of early repayment to the investor. By the midpoint of the loan term at 18 months, more than half of the total interest has been paid but less than half of the principal has been recovered. While return continues to increase so long as additional interest is paid over the three year loan term, the annualized yield decreases over time. If the borrower repays the loan early, the investor can reinvest all of the principal in another loan and thereby earn more on an annual basis.

The point markers on the plot show the actual performance of Prosper loans matching the modelled parameters ($5000, 36-month, fully repaid loans at 20% interest). The generally close fit of the data points around the predicted lines shows that the loans are behaving as expected, although there is some variation. The divergence from the predicted line is especially apparent for investor return at full term, 36 months, where we see multiple data points above and below the line. The variations in return most likely reflect real-world differences from the assumptions in the model, such as varying investor service fees (modeled at 1%) or penalty income from late payments, as well as noise (error) in the data.

Plot Three

Description Three

This graphic explores the main drivers of APY for charged-off and defaulted loans. The plot shows the same funnel-shaped pattern that was evident in Plot One: loans that close in months 0-9 have the widest range of APYs and those that close near the end of the term have the narrowest. Now, however, the reasons for this shape are clearer. There are some loans with a high proportion of principal, interest and fee payments in every age category, and these factors are driving APY. Even when the borrower defaults during the first nine months, investors can still make money from those borrowers who repaid a high proportion of principal along with significant interest and fees.

It is now possible to see that age at loan closing is associated with both positive and negative effects on investor APY. The earlier a loan closes, the higher its APY will be for the same amount of investor gain. But the later a loan closes, the more likely it is that a higher percentage of the original principal has been repaid. Loan age and percent of principal repaid are tightly associated, with a correlation of 0.905.

This plot also reminds us that the vast majority of defaulted loans are money losers for the investor. We can see that the highest APYs are mostly outliers. In fact, lost loans with positive APY constitute the top 14.5% of defaulted loans by performance; only 2.23% of defaulted loans achieve APY greater than 20%.

Reflection

When I first saw the high yields associated with early closing loans, it sparked my interest because I thought it might lead to an improved investor strategy for maximizing portfolio yields. Having done the analysis, I find that early repayment has a fairly small impact on the investor’s annualized yield. The correlation between loan age at closing and APY for fully repaid loans is only -0.037, and the regression model coefficient predicted that 10 additional months to repayment would result in a decrease in APY of only 0.35%. The model based on financial formulas shows that early repayment should increase the APY of fully-repaid loans, but in the real-world dataset this impact is unclear. Overall, the analysis indicates that investors would benefit from encouraging full, early repayment by borrowers.

Next steps in the analysis could focus on predicting the risk of default based on borrower characteristics. Default is the single largest risk to investors. The analysis I’ve done on early loan closing suggests that rather than modeling the risk of default as a binary variable, it could be modeled as a continuous variable ranging from earliest full principal repayment to earliest full principal default. The dependent variable could be an interactive term based on the percent of principal repaid and age of loan at closing. This term would represent the relative financial strength of the borrower on a continuous scale.